5. Main Analysis
Zip Code, Neighborhood and Land Value Data
One of our other main questions in this project was looking at how the wealth was distributed amongst each of the zipcodes in the five boroughs. Using the census data from the superzip dataset and mapping that to the zipcode shapefile document we found, we managed to visualize the wealth of each neighborhood. We will later be able to plot the license applications over this mapping of the wealth distribution.
map <- get_map("New York City", source = "google", maptype = "roadmap", zoom = 11, color="bw")
ggmap(map, base_layer = ggplot(aes(x = longitude, y = latitude, color = avg), data = census)) + geom_point(size = 3, alpha=0.7) + scale_color_viridis() + ggtitle("Average Salary of each Zipcode in New York")
This map shows all of the zipcodes and we can clearly see that by far the greatest incomes per zipcode are in the Upper East Side and Upper West Side. There appears to be some grey areaa if you look closely at the Upper West Side. This corresponds to about $150,000/per year on the color map. Then, midtown and downtown also have high incomes. After that, they all fall towards the bottom of the spectrum. Thus, I think it would be useful to make a map showing only Manhattan, and then the everything excluding Manhattan. This should give us a better idea of what areas outside of Manhattan might see more bar/cafe openings.
census_manhattan <- census %>% dplyr::filter(city.x == 'Manhattan')
ggmap(map, base_layer = ggplot(aes(x = longitude, y = latitude, color = avg), data = census_manhattan)) + geom_point(size = 4, alpha=0.7) + scale_color_viridis() + ggtitle("Average Salary of each Zipcode in Manhattan")
This graph shows all of the data filtered for Manhattan only. We can see that once we get to Harlem and above, the average median income per neighborhood drops heavily. The rest of the city is pretty similar.
brooklyn_map2 <- get_map("Brooklyn, NY", source = "google", zoom = 12, maptype="roadmap", color="bw")
census_brooklyn <- census %>% dplyr:: filter(city.x == 'Brooklyn')
ggmap(brooklyn_map2, base_layer = ggplot(aes(x = longitude, y = latitude, color = avg), data = census_brooklyn)) + geom_point(size = 4, alpha=0.7) + scale_color_viridis() + ggtitle("Average Salary of each Zipcode in Brooklyn")
This shows us the average salary throughout the zipcodes in Brooklyn and we can see that it is highest in general when the neighborhood is closer to Manhattan.
At this point, I had realized that it would be hard to make a combined map illustrating both the income data and licenses data by displaying the zips like this. Thus, I found zipcode shapefile data for NYC and imported it in order to make better maps to be used as a base layer for the licensing data. I also begun labelling the neighborhoods using the labels Marika had created.
plotting_zips1<-read.csv("Data/NY_shapefiles.csv")
ggmap(map) + geom_polygon(aes(fill = avg, x = long, y = lat, group = group), data = plotting_zips1, alpha = 0.8) + ggtitle("Distribution of Average Wealth per Neighborhood in New York")
Here we see that Manhattan has the highest wealth. It has the wealthiest areas but also some areas in Harlem and more uptown that are less wealthy than most neighborhoods in Queens, Brooklyn and the Bronx.
g_all <- ggmap(map) + geom_polygon(aes(fill = avg, x = long, y = lat, group = group), data = plotting_zips1, alpha = 0.8)
g_all + ggtitle("Distribution of Wealth in New York") + scale_fill_viridis()
Here we see a full distribution using the viridis color scheme. This will be used as the base for the overlaid plots later on and we will then filter by each borough.
Wealth by borough - Adam
plotting_zips_manhattan <- plotting_zips1 %>% filter(city.y == "New York")
plotting_zips_brooklyn <- plotting_zips1 %>% filter(city.y == "Brooklyn")
plotting_zips_bronx <- plotting_zips1 %>% filter(city.y == "Bronx")
plotting_zips_queens <- plotting_zips1 %>% filter(!(city.y %in% c("New York", "Brooklyn", "Bronx", "Staten Island")))
bronx_map <- get_map("Bronx, NY", source = "google", zoom = 12, maptype="roadmap", color="bw")
brooklyn_map <- get_map("Brooklyn, NY", source = "google", zoom = 12, maptype="roadmap", color="bw")
queens_map <- get_map("Astoria, NY", source = "google", zoom = 12, maptype="roadmap", color="bw")
We are not using Staten Island since we have no sidewalk cafe data for Staten Island to compare the wealth to.
g_manhattan <- ggmap(map) + geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_manhattan, alpha = 0.5)
g_manhattan + ggtitle("Distribution of Wealth in Manhattan") + geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3) + scale_fill_viridis()
This map show us the wealth in Manhattan alone by zipcode. We are now using the viridis color scale so that it is perceptually uniform and easier to see differences in color. This clearly shows that the income is very low at Harlem and above, including Morningside Heights. This is likely due to the fact that it is almost exclusively students living in Morningside Heights and Harlem is cheaper. Also, it is interesting to see that Lower East Side has a similar average income to Harlem. The wealthiest areas are by far Upper West Side and Upper East Side. We can also see that we are missing what seems to be data for two zipcodes in the Upper East side. This corresponds to some of the data that Marika had found was missing and still did not appear in the superzip dataset.
g_brooklyn <- ggmap(brooklyn_map) + geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_brooklyn, alpha = 0.5)
g_brooklyn + ggtitle("Distribution of Wealth in Brooklyn") + geom_text_repel(data=brooklyn_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3) + scale_fill_viridis()
## Warning: Removed 3 rows containing missing values (geom_text_repel).
By observing the income distribution in Brooklyn, we see that in general, the areas closer to Manhattan have the highest income such as Brooklyn Heights and Crown Heights.
g_bronx <- ggmap(bronx_map) + geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_bronx, alpha = 0.5)
g_bronx + ggtitle("Distribution of Wealth in Bronx") + geom_text_repel(data=bronx_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3) + scale_fill_viridis()
g_queens <- ggmap(queens_map) + geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_queens, alpha = 0.5)
g_queens + ggtitle("Distribution of Wealth in Queens") + geom_text_repel(data=queens_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3) + scale_fill_viridis()
## Warning: Removed 29 rows containing missing values (geom_text_repel).
Surprisingly, it seems that in Queens, some of the higher income areas actually appear further away from Manhattan like in Flushing and Forest Hills and Rego Park. However, there are some higher income areas closer to Manhattan as well like East Elmhurst.
Sidewalk Cafe License Data
Any business that operates a portion of a restaurant on a public sidewalk must obtain a Sidewalk Cafe License from New York City. These licenses must be renewed every two years and fall into three categories: enclosed, unenclosed, or small unenclosed sidewalk cafes.
Analyzing by Borough
First, to help better organize the sidewalk cafe licenses by borough, I added a new column called BOROUGH that is set to MANHATTAN, BROOKLYN, BRONX, or QUEENS. I had to manually check that only the cities in Queens had been called out specifically in the CITY column, so it was easy to distinguish them from BRONX or BROOKLYN.
sidewalks <- sidewalks %>% mutate(BOROUGH = ifelse(CITY=="NEW YORK"|CITY=="New York","MANHATTAN",ifelse(CITY=="BROOKLYN","BROOKLYN",ifelse(CITY=="BRONX","BRONX","QUEENS"))))
To get a better understanding of the distribution of these licenses, I have provided a bar graph by borough.
ggplot(sidewalks, aes(x=fct_infreq(BOROUGH)))+geom_bar(aes(fill=BOROUGH))+ggtitle("Frequency of Sidewalk Cafe Licenses by Borough")+xlab("Borough")+ylab("Frequency")+theme_fivethirtyeight()
Clearly Manhattan has the most license requests, followed by Brooklyn, then Queens and finally Bronx. Since at the moment we don’t have neighborhood information (everything in Manhattan is just classified as New York, Brooklyn has only Brooklyn, and Bronx has only the city of Bronx), we can only dive into the Queens data:
queens_cafes <- sidewalks %>% filter(BOROUGH=="QUEENS")
ggplot(queens_cafes, aes(x=fct_infreq(CITY)))+geom_bar(fill="purple")+ggtitle("Frequency of Sidewalk Cafe Licenses in Queens")+xlab("City / Neighborhood")+ylab("Frequency")+coord_flip()+theme_fivethirtyeight()
In Queens, a large percentage of license requests come from Astoria, followed by Long Island City and Forest Hills.
Next, in order to do date comparisons to ascertain which are the new applications vs. renewal applications, I had to convert certain date fields from strings (they were read in as string factors) into dates.
sidewalks$EXPIRATION_DATE<-as.Date(sidewalks$EXPIRATION_DATE, format="%m/%d/%Y")
sidewalks$APP_STATUS_DATE<-as.Date(sidewalks$APP_STATUS_DATE, format="%m/%d/%Y")
sidewalks$SUBMIT_DATE<-as.Date(sidewalks$SUBMIT_DATE, format="%m/%d/%Y")
Licenses by Status
The list of licenses includes active licenses, expired licenses, licenses for businesses that have closed (and are now inactive), licenses which are up for renewal as part of the two year process, or new requests for licenses. To better classify them, I created a new field called STATUS_CLASSIFICATION. Those licenses which are still active and not up for renewal are classified as “ACTIVE”. Those licenses that have been submitted for renewal (either because their expiration date is less than the latest application data, or that an active license is up for review) are classified as “RENEWAL”. Those licenses that are in the sheet but do not have a license number are classified as “NEW”, and the rest are marked as “OLD” to encompass inactive licenses that have not been acted upon.
sidewalks<-sidewalks %>% mutate(STATUS_CLASSIFICATION = ifelse(LIC_STATUS=="Active" & (APP_STATUS=="Application Approved" | APP_STATUS=="Application Review Completed"),"ACTIVE",ifelse(is.na(LICENSE_NBR),"NEW",ifelse((APP_STATUS_DATE>EXPIRATION_DATE | DPQA=="Issued Temp Op Letter") | (LIC_STATUS=="Active" & (APP_STATUS=="Pending Review" | APP_STATUS=="Submitted")),"RENEWAL","OLD"))))
Now that we have classified the status of the licenses, we are able to see how these classifications differ between the boroughs.
ggplot(sidewalks)+geom_mosaic(aes(x=product(STATUS_CLASSIFICATION,BOROUGH),fill=factor(STATUS_CLASSIFICATION)))+coord_flip()+labs(x="Borough",y="License Status", fill="License Designation")+ggtitle("Boroughs by License Status")+theme_fivethirtyeight()
The mosaic plot shows how Bronx and Brooklyn may be getting more new license requests as a percentage of total licenses. Bronx is also getting the highest percentage of renewal requests out of its inactive and active licenses. We can also take a look at the license designations by borough:
ggplot(sidewalks)+geom_mosaic(aes(x=product(BOROUGH,STATUS_CLASSIFICATION),fill=factor(BOROUGH)))+coord_flip()+labs(x="License Status",y="Borough", fill="Borough")+ggtitle("License Status by Borough")+theme_fivethirtyeight()
Looking at the data in this way, you can see how Brooklyn has the second-most new license requests, but how Manhattan still dominates in all license status categories.
Mapping Licenses
We can map the data to have a better view of where the datapoints lie. To get an overall picture, I selected a map centered on Long Island City in Queens so that we can get a good view of both Brooklyn and Bronx in addition to Manhattan.
map <- get_map( location = c(-73.9485424, 40.7454513), source = "google", zoom = 11, maptype="roadmap", color="bw")
Plotting each of the restaurants colored by their borough. You can see how Manhattan dominates in the number of sidewalk cafes, and how the sidewalk cafes in Brooklyn and Queens are largely concentrated in the areas closer to Manhattan.
ggmap(map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=BOROUGH),data=sidewalks, alpha=0.3)+ggtitle("Distribution of all licenses (inactive or active) in NYC")
Even with alpha, it is difficult to tell exactly where the largest concentrations of sidewalk cafes lie. Using a density plot, we can better see the concentration of sidewalk cafes around mid-to lower Manhattan, and the clusters in Astoria and Williamsburg.
g <- ggmap(map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks, geom="polygon", alpha=0.2)
g+scale_fill_gradient(low="yellow",high="red")+ggtitle("Heatmap of Sidewalk Cafe Licenses in NYC")
The next question we can ask is whether there are clear patterns to where new license requests are coming in from, where they are being renewed, or where they have expired without renewal.
ggmap(map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=BOROUGH),data=sidewalks, alpha=0.4)+facet_wrap(~STATUS_CLASSIFICATION)+ggtitle("Sidewalk Cafe Licenses by Status")
However, since we are zoomed out a lot and looking at the entire set of licenses, it is difficult to tell the difference between the distributions of license requests. For this analysis, I am only concerned about active licenses (whether they are being renewed or not), and new requests that have not yet been approved. To do this, I created another STATUS_CLASSIFICATION that groups active renewal requests into the Active category, and sets all other non-new requests as “inactive”.
sidewalks <- sidewalks %>% mutate(STATUS_CLASSIFICATION2=ifelse(STATUS_CLASSIFICATION=="ACTIVE" | (STATUS_CLASSIFICATION=="RENEWAL" & LIC_STATUS=="Active"), "ACTIVE", ifelse(STATUS_CLASSIFICATION=="NEW","NEW","OLD")))
new_active <- sidewalks %>% filter(STATUS_CLASSIFICATION2=="ACTIVE" | STATUS_CLASSIFICATION2=="NEW")
ggmap(map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=BOROUGH),data=new_active, alpha=0.3)+facet_wrap(~STATUS_CLASSIFICATION2)+ggtitle("Active and New Licenses in New York City")
Since we are quite zoomed out, it is difficult to see what exactly is happening with the New requests. However, if we zoomed in, we would lose information about the Bronx or lower Brooklyn. In order to determine whether there is any useful information there, I have zoomed in on the Bronx region.
Geographic Analysis - Bronx
bronx_map <- get_map("Bronx, NY", source = "google", zoom = 12, maptype="roadmap", color="bw")
ggmap(bronx_map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=BOROUGH),data=new_active)+facet_wrap(~STATUS_CLASSIFICATION2)+ggtitle("Active and New Licenses in the Bronx")+theme_fivethirtyeight()
Since there are very few active or new licenses in the Bronx, I feel comfortable in zooming in on the rest of Manhattan / Brooklyn and Queens in order to be able to better see what is happening at the expense of Bronx.
Geographic Analysis - Manhattan
I want to take a closer look at what is happening in Manhattan. In order to do this at a more granular level, I will use my merged data with our master_zip document which lists the neighborhoods of each Borough by zip code. I also pulled out the year of the submission of the license application and saved it as SUBMIT_YEAR.
master_zip<-read.csv("Data/zips_master_no_missing_nbrh.csv", strip.white=TRUE)
sidewalks_nbh <- merge(sidewalks, master_zip, by="ZIP", all.x=TRUE)
sidewalks_nbh <- sidewalks_nbh %>% mutate(SUBMIT_YEAR=year(SUBMIT_DATE))
sidewalks_manhattan_active <- sidewalks_nbh %>% filter(BOROUGH=="MANHATTAN" & (STATUS_CLASSIFICATION2=="ACTIVE" | STATUS_CLASSIFICATION2=="NEW"))
manhattan_map <- get_map("Manhattan, NY", source="google", maptype="roadmap", zoom=12, color="bw")
I have also calculated the average locations of the neighborhoods in each borough based on their assigned Zip codes.
ggmap(manhattan_map)+geom_point(aes(x=LONGITUDE, y=LATITUDE, color=nbh), data=sidewalks_manhattan_active)+ggtitle("All active or new sidewalk cafes in Manhattan")
sidewalks_bronx<- sidewalks_nbh %>% filter(BOROUGH=="BRONX")
This view shows all of the distribution of all currently active or new sidewalk cafes in Manhattan. To get a better idea of density, we can also look at a heatmap of this view.
g <- ggmap(manhattan_map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks_manhattan_active, geom="polygon", alpha=0.3)
g+scale_fill_gradient(low="yellow",high="red")+ggtitle("Heatmap of Sidewalk Cafe Licenses in Manhattan")+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)
This heatmap shows the highest concentration of sidewalk cafes in Greenwich Village and NoHo. To see the difference between active and new licenses, we can facet based on our previously calculated categorization.
g <- ggmap(manhattan_map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks_manhattan_active, geom="polygon", alpha=0.3)+facet_wrap(~STATUS_CLASSIFICATION2)
g+scale_fill_gradient(low="yellow",high="red")+ggtitle("Heatmap of Sidewalk Cafe Licenses in Manhattan")+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)+theme_fivethirtyeight()+guides(fill=FALSE)
The heatmap view of the map is not a great one - it is difficult to tell the differences on the same scales in a heatmap. Looking at the same view, but by using geom_point again, we get the following view:
ggmap(manhattan_map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=nbh),data=sidewalks_manhattan_active)+facet_wrap(~STATUS_CLASSIFICATION2)+ggtitle("Active and New Licenses in Manhattan") + guides(color=FALSE)+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)+theme_fivethirtyeight()
Once again, the data is not telling us too much since there are very few new requests. Perhaps a better way to categorize this data is to look at the years that applications were submitted, therefore ignoring whether a license is currently active or not.
year_bxp<-ggplot(sidewalks_nbh, aes(y=SUBMIT_YEAR,x=1))+geom_boxplot()+ggtitle("Boxplot of Submission Years")+theme_fivethirtyeight()
year_bar<-ggplot(sidewalks_nbh,aes(x=SUBMIT_YEAR))+geom_bar()+ggtitle("Bar Graph of Submission Years")+theme_fivethirtyeight()
grid.arrange(year_bxp, year_bar, ncol=2)
Since there are very few submissions between 2000 and 2014, I will be removing these outliers as potential mistakes where the submit dates were not updated once the licenses were renewed. Now we can look at a mosaic plot of the different neighborhoods and the years that the licenses were requested.
sidewalks_nbh<- sidewalks_nbh %>% filter(SUBMIT_YEAR>2014)
sidewalks_manhattan<- sidewalks_nbh %>% filter(BOROUGH=="MANHATTAN")
manhattan_counts <- sidewalks_manhattan %>% group_by(nbh, SUBMIT_YEAR) %>% summarise(count=n())
manhattan_counts$SUBMIT_YEAR<-as.factor(manhattan_counts$SUBMIT_YEAR)
manhattan_counts$SUBMIT_YEAR<-factor(manhattan_counts$SUBMIT_YEAR, c("2017", "2016","2015"))
ggplot(manhattan_counts, aes(x=reorder(nbh, count), y=count, fill=SUBMIT_YEAR))+geom_bar(stat="identity",position=position_dodge())+coord_flip()+xlab("Neighborhood")+ggtitle("License Requests per Neighborhood and Year")+theme_fivethirtyeight()
It makes sense that 2016 would have more applications than 2015, since many of the 2015 applications have been renewed by now. It looks like Greenwich Village, Upper West Side, and Upper Easy Side consistently have the highest number of requests throughout the three years. However, to get a better understanding of the distribution of areas, we can create three separate heatmaps.
map2015 <- ggmap(manhattan_map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks_manhattan%>%filter(SUBMIT_YEAR=="2015"), geom="polygon", alpha=0.3)+scale_fill_gradient(low="yellow",high="red")+ggtitle("Manhattan Licenses Applied for in 2015")+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)
map2015
map2016<-ggmap(manhattan_map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks_manhattan%>%filter(SUBMIT_YEAR=="2016"), geom="polygon", alpha=0.3)+scale_fill_gradient(low="yellow",high="red")+ggtitle("Manhattan Licenses Applied for in 2016")+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)+guides(fill=FALSE)
map2016
map2017 <- ggmap(manhattan_map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks_manhattan%>%filter(SUBMIT_YEAR=="2017"), geom="polygon", alpha=0.3)+scale_fill_gradient(low="yellow",high="red")+ggtitle("Manhattan Licenses Applied for in 2017")+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)+guides(fill=FALSE)
map2017
Across all three years, you can see how the concentrated area of licenses remains in the Greenwich Village area. However, it looks like the concentration of applicants in the Upper West Side in 2017 has dropped. This map is more useful than the bar chart because it shows the location of the cafes that may be straddling two neighborhoods - information that is not apparent from the grouped bar chart.
We can do the same sort of analysis in all four boroughs that we have sidewalk cafe information about. To make this analysis easier to view, I have created a shiny app which can be found by following this link: ShinyCafe
sidewalks_brooklyn <- sidewalks_nbh %>% filter(BOROUGH=="BROOKLYN")
sidewalks_queens <- sidewalks_nbh %>% filter(BOROUGH=="QUEENS")
sidewalks_bronx <- sidewalks_nbh %>% filter(BOROUGH=="BRONX")
brooklyn_map <- get_map("Brooklyn, NY", source = "google", zoom = 12, maptype="roadmap", color="bw")
queens_map <- get_map("Astoria, NY", source = "google", zoom = 12, maptype="roadmap", color="bw")
The Shiny app has a brief blurb explaining what each borough can tell us about the movement of sidewalk cafe applications.
Next, I added Adam’s income distribution ploygons to the background of these maps to see what the relationship is to income.
ggmap(manhattan_map)+geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_manhattan, alpha = 0.7)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color="purple"),data=sidewalks_manhattan, size=1)+ scale_fill_viridis()+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)+guides(color=F)
Manhattan has a pretty clear clustering of sidewalk cafes in the higher income areas, with the exception of Lower Mahattan and the Financial District.
ggmap(brooklyn_map)+geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_brooklyn, alpha = 0.7)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color="purple"),data=sidewalks_brooklyn, size=1)+ scale_fill_viridis()+guides(color=FALSE)
The missing data in Williamsburg is hindering this analysis, but we do see how the average income in Brooklyn Heights, Park Slope, and Red Hook correlate with more sidewalk cafes.
ggmap(queens_map)+geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_queens, alpha = 0.7)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color="purple"),data=sidewalks_queens, size=1)+ scale_fill_viridis()+geom_text_repel(data=queens_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)+guides(color=F)
Astoria and Long Island City are interesting cases with relatively low average incomes but a high concentration of sidewalk cafes. The other area to the south-east, around Parkside and Forest Hills, has a high concentration of cafes and high income.
ggmap(bronx_map)+geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_bronx, alpha = 0.7)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color="purple"),data=sidewalks_bronx, size=1)+ scale_fill_viridis()+geom_text_repel(data=bronx_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)+guides(color=F)
With the very low number of Bronx datapoints, it is not easy to see a realtionship between income and where the sidewalk cafes are located.
mean_incomes <- plotting_zips1 %>% group_by(ZIP) %>% summarise(mean_income=mean(income.))
zip_sidewalk <- sidewalks_nbh %>% group_by(ZIP, BOROUGH) %>% summarise(n_cafes=n())
summary_cafe <- merge(zip_sidewalk, mean_incomes, by="ZIP")
ggplot(summary_cafe, aes(x=mean_income, y=n_cafes))+geom_point()+xlab("Average Income (in Thousands of Dollars)")+ylab("Number of Sidewalk Cafes")+geom_smooth(method=lm)+theme_fivethirtyeight()
cafe.lm = lm(mean_income~n_cafes, data=summary_cafe)
summary(cafe.lm)$r.squared
## [1] 0.1965228
On all the data, there is pretty weak correlation between income and number of sidewalk cafes, with an r-squares of 0.196.
ggplot(summary_cafe, aes(x=mean_income, y=n_cafes))+geom_point()+xlab("Average Income (in Thousands of Dollars)")+ylab("Number of Sidewalk Cafes")+geom_smooth(method=lm)+facet_wrap(~BOROUGH, scales="free")+theme_fivethirtyeight()
Faceting by Borough, we can see how the relationship is actually negative in Bronx and Queens, but positive in Manhattan and Brooklyn (with the strongest relationship in Manhattan).
Putting it All Together
Our main graphs summarizing the data can be seen in the Executive Summary. However, we found that density countour maps of both of the liquor data and sidewalk cafe data tell a powerful story from 2015 to 2017.
liquor<-read.csv("Data/liquor_licenses/massaged_data.csv")
liquor<-liquor %>% filter(Classification=="LIQUOR")
hdr<-c("LONGITUDE","LATITUDE","YEAR","BOROUGH")
side<-sidewalks_nbh %>% filter(SUBMIT_YEAR>2014 & SUBMIT_YEAR<2018)%>%select(LONGITUDE, LATITUDE,SUBMIT_YEAR, BOROUGH)
liq<-liquor %>% filter(effective_year>2014 & effective_year<2018)%>%select(Longitude, Latitude,effective_year, BOROUGH)
colnames(liq)<-hdr
colnames(side)<-hdr
side<-side%>%mutate(TYPE="Sidewalk Cafe")
liq <-liq %>% mutate(TYPE="Bar")
total <- rbind(liq,side)
ggmap(map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, color=TYPE),size=2, alpha=.5,contour=TRUE,data=total, geom="density2d")+xlab("") + ylab("")+theme(axis.text.x=element_blank(),axis.text.y=element_blank(),axis.ticks.x=element_blank(),axis.ticks.y=element_blank(),axis.line = element_line(color = NA))+ guides(fill=FALSE,alpha=FALSE)+facet_wrap(~YEAR)+theme_fivethirtyeight()
The graph shows how in 2015, most establishments are concentrated mostly in Manhattan, with bubbles in Brooklyn Heights, Williamsburg, and Astoria. However, in 2016, you can see the bars density contours spread out to neighborhoods next to Manhattan, especially in Brooklyn. In 2017, we are seeing the same happen with sidewalk cafes.
Liquor License Data
First I checked the distribution of liquor licenses by zip code.
temp <- nyc_liquor_licenses %>% group_by(mod_zip) %>% summarise(num = n())
qplot(temp$num, binwidth = 10)+theme_fivethirtyeight()
It is not surprising that this is a long tailed distribution, however it is very surprising that some zip codes have very high concentrations of liquor licenses. I looked into the zip codes with especially high number of licenses.
nyc_liquor_licenses %>% group_by(mod_zip, nbh) %>% summarise(num = n()) %>% filter(num > 400)
## Source: local data frame [5 x 3]
## Groups: mod_zip [5]
##
## mod_zip nbh num
## <dbl> <fctr> <int>
## 1 10001 Midtown 421
## 2 10003 Greenwich Village 541
## 3 10019 Midtown 476
## 4 10036 Midtown 445
## 5 11211 Greenpoint & Williamsburg 477
Unsurprisingly, 3 of these 5 highly concentrated areas are in Midtown. One interesting thing I found while investigating these zip codes were the singular premises with the highest concentrations.
nyc_liquor_licenses %>% group_by(Actual.Address.of.Premises..Address1.) %>% summarise(num = n()) %>% filter(num>15)
## # A tibble: 2 × 2
## Actual.Address.of.Premises..Address1. num
## <fctr> <int>
## 1 GRAND CENTRAL TERMINAL 101
## 2 PENN STATION 76
Both of these are transportation hubs. As a frequent commuter from Grand Central, I seemed very surprised that there were 101 places there that sell alcohol.
nyc_liquor_licenses %>% filter(Actual.Address.of.Premises..Address1. == "GRAND CENTRAL TERMINAL") %>% group_by(Premises.Name) %>% summarise(num = n())
## # A tibble: 3 × 2
## Premises.Name num
## <fctr> <int>
## 1 COMPASS ROSE SUSHI LLC 1
## 2 GRAND CENTRAL OYSTER BAR INC 1
## 3 METRO NORTH COMMUTER RAILROAD 99
This was even more surprising to me as I have never been able to purchase alcohol on Metro North. Upon further inspection
nyc_liquor_licenses %>% filter(Actual.Address.of.Premises..Address1. == "GRAND CENTRAL TERMINAL") %>%
filter(Premises.Name == 'METRO NORTH COMMUTER RAILROAD') %>% group_by(Agency.Zone.Office.Name) %>%
summarise(num = n())
## # A tibble: 1 × 2
## Agency.Zone.Office.Name num
## <fctr> <int>
## 1 Albany 99
Albany issues all the licenses. I have never been on a train to Albany, but next time I am, I will definitely ask for an adult beverage.
This made me want to know more about the different types of liquor licenses that are granted.
nyc_liquor_licenses %>% ggplot(., aes(x=fct_infreq(License.Type.Name))) + geom_bar() + coord_flip() +theme_fivethirtyeight()
There are a lot of fringe categories such as Distiller B-1 and Distiller A that are offered to a few places. However, the active liquor licenses are unsurprisingly dominated by on-premise liquor establishments, sometimes known as bars. There is also a distinction between beer and wine. I would like to explore that further.
Another basic variable I wanted to investigate was the how effective liquor licences have grown over the years.
nyc_liquor_licenses %>% group_by(effective_year) %>% summarise(num = n()) %>%
ggplot(., aes(x=effective_year, y = num)) + geom_bar(stat='identity')+theme_fivethirtyeight()
The biggest years represented are 2015 and 2016. I am not surprised by the fact that 2017 has fewer records. The year is only 1/3 done, and I don’t know how quickly it takes an active liquor license to be successfully recorded so it might be possible that liquor licenses that have been granted in Februrary and March are not included in the dataset. I do want to look a little further at 2014.
My first hypothesis is that only licenses in later months were included
nyc_liquor_licenses %>% mutate(month = months(License.Effective.Date)) %>%
mutate(month_ind = month(License.Effective.Date)) %>% group_by(month, month_ind, effective_year) %>%
summarise(num = n()) %>% ggplot(., aes(x = reorder(month, -month_ind), y = num)) + geom_bar(stat='identity') +
facet_wrap(~effective_year) + coord_flip()+theme_fivethirtyeight()
Although this data would be better shown in a time-series plot, the pattern is clear. The distributon of bars in 2014 by month does not suggest that my hypothesis is correct as the number of active licenses from March to December in that year is roughly constant and all other years other than 2017 show that licenses that are effective in January and February are fewer in comparison to other months. It might be coincidental, but in 2014-2016, March and October have a slightly higher number of licenses that become effective that month.
I then decided to look into license type.
nyc_liquor_licenses %>% filter(effective_year == 2014) %>% ggplot(., aes(x=fct_infreq(License.Type.Name))) + geom_bar() + coord_flip()+theme_fivethirtyeight()
Astonishingly, there are way fewer types of licenses in 2014. Additionally, there are almost no on-premise liquor licenses. My guess as to why this happens is that the liquor licenses for bars expire quicker. I wanted to look into how long licenses are effective for.
temp <- nyc_liquor_licenses %>% mutate(active_length = License.Expiration.Date - License.Effective.Date)
qplot(temp$active_length, binwidth = 7)+theme_fivethirtyeight()
Essentially, there seem to be two huge spikes which seem to correspond to 2 and 3 years
nyc_liquor_licenses %>% mutate(active_length = License.Expiration.Date - License.Effective.Date) %>%
group_by(active_length) %>% summarise(num = n()) %>% arrange(desc(num)) %>% head
## # A tibble: 6 × 2
## active_length num
## <time> <int>
## 1 730 days 4245
## 2 1095 days 3595
## 3 729 days 3082
## 4 1094 days 2290
## 5 727 days 177
## 6 364 days 166
By far, the most common lengths of licenses are 730 and 1095 which are 2 and 3 years respectively. Looking at the graph, the rest of the lengths are close to these lengths
nyc_liquor_licenses %>% filter(License.Type.Name == 'ON-PREMISES LIQUOR') %>%
mutate(active_length = License.Expiration.Date - License.Effective.Date) %>%
qplot(x = active_length, data = ., binwidth = 7)+theme_fivethirtyeight()
It does look like on-premise liquor licenses are for two years.
nyc_liquor_licenses %>% filter(License.Type.Name == 'GROCERY BEER, WINE PROD') %>%
mutate(active_length = License.Expiration.Date - License.Effective.Date) %>%
qplot(x = active_length, data = ., binwidth = 7)+theme_fivethirtyeight()
In contrast, grocery licenses look to be for around 3 years
The month plot made me question how licenses become active during the month.
nyc_liquor_licenses %>% mutate(dom = day(License.Effective.Date)) %>% qplot(x=dom, data=.)+theme_fivethirtyeight()
And for expiring date,
nyc_liquor_licenses %>% mutate(dom = day(License.Expiration.Date)) %>% qplot(x=dom, data=.)+theme_fivethirtyeight()
From these two plots, we can see that licenses seem to become effective at the beginning of the month and expire at the end of the month,
I also wanted to look at the spatial distribution.
ggmap(manhattan_map)+stat_density2d(aes(x=Longitude,y=Latitude, fill=..level..),data=nyc_liquor_licenses%>%filter(effective_year=="2015") %>%
filter(city.x == "Manhattan"), geom="polygon", alpha=0.3)+scale_fill_gradient(low="yellow",high="red")
ggmap(manhattan_map)+stat_density2d(aes(x=Longitude,y=Latitude, fill=..level..),data=nyc_liquor_licenses%>%filter(effective_year=="2016") %>%
filter(city.x == "Manhattan"), geom="polygon", alpha=0.3)+scale_fill_gradient(low="yellow",high="red")
In these two maps we see the spatial distribution between two years in which the license became effective. They are both very similar with the Midtown and Lower East Side having the highest density. The biggest difference between the two is that Morningside heights and north has more activity in 2016. Again, this isn’t to say that new places popped up in these neighborhoods, but licensed became active in this year.
ggmap(manhattan_map)+stat_density2d(aes(x=Longitude,y=Latitude, fill=..level..),
data=nyc_liquor_licenses%>% filter(city.x == "Manhattan") %>%
filter(License.Type.Name == "EATING PLACE BEER"), geom="polygon", alpha=0.3)+scale_fill_gradient(low="yellow",high="red")
ggmap(manhattan_map)+stat_density2d(aes(x=Longitude,y=Latitude, fill=..level..),
data=nyc_liquor_licenses%>% filter(city.x == "Manhattan") %>%
filter(License.Type.Name == "RESTAURANT WINE"), geom="polygon", alpha=0.3)+scale_fill_gradient(low="yellow",high="red")
These two maps compare wine in restaurants to beer in eating places. In the top map, the hotspot for beer stretches from Midtown all the way down to the Lower East Side. Additionally, outside of the upper east side and upper west side, the map is pretty much covered. In contrast, for wine, the concentration is clearly concentrated in the South and the East. Furthermore, the areas not covered in the beer map are covered in the wine map outside of Central Park. These areas tend to be more affluent, so this gives some evidence that wine is enjoyed more by more affluent people. ## 6. Conclusion
This project delved into exploring where sidewalk cafe and liquor licenses are issued and the geospatial relationship between these variables and income. Intuitively, areas generally associated as up and coming neighborhoods such as WIlliamsburg were found to see increases in the past two years. However, in Queens, there was found to be a negative correlation between income and sidewalk cafe and liquor licenses.
All of these conclusions should be taken with a grain of salt. First, we can’t be sure that the data for either sidewalk cafes or liquor licenses is complete. There might be places who operate without a license or places that aren’t recorded properly in the dataset. One such example was found in the liquor license data. Additionally, only two liquor licenses were granted to non rail-cars in Grand Central. This seems to be underestimating the true value. There were also two missing zip codes in the data and 15 zip codes that were not found in the liquor license data, giving further evidence that this dataset was incomplete.
The income data was collected by the census. While this data should be high quality, the most recent census was conducted in 2010 which means data might not be recent, especially considering the analysis focused on the years 2015-2016. Also, we used when the license was issued as a proxy for which neighborhoods were growing. In the sidewalk cafe data, renewed licenses overwrote the existing license for the place. In the liquor license data, it was seen that licenses had to renewed every two years for bars. Without access to all the historical data it is hard to separate growth from the renewal of licenses.
In terms of correlation, some neighborhoods such as MIdtown were found to have a high number of liquor licenses and sidewalk cafes. These correlations might be spurious as more people tend to live in these areas. This analysis only focused on how income affected sidewalk cafes and bars popping up, but the real cause might be something we didn’t account for such as demographic information, ease of transportation, etc.
For an introductory study, we were able to identify several trends in how licenses are issued, how wealth is distributed geographically across New York City, and correlations between these two variables. Without doing any modelling, we were able to generate and assess the validity of several hypotheses, and some of our preconceived notions were proven wrong. Hopefully, this study was able to show promise in how visualizing and mapping licenses can help provide insight about the what drives certain institutions to grow.